QTM 447 Lecture 10: Backpropogation and Gradient Problems

Kevin McAlister

February 13, 2025

Neural Networks

\[ \newcommand\hbb{{\hat{\boldsymbol \beta}}} \newcommand\bb{{\boldsymbol \beta}} \newcommand\expn{{\frac{1}{N} \sum \limits_{i = 1}^N}} \newcommand\sumk{\sum \limits_{k = 1}^K} \newcommand\argminb{\underset{\bb}{\text{argmin }}} \newcommand\argmaxb{\underset{\bb}{\text{argmax }}} \newcommand\gtheta{\mathbf g(\boldsymbol \theta)} \newcommand\htheta{\mathbf H(\boldsymbol \theta)} \]

A deep neural network can be succinctly defined in two equations:

\[\mathcal L(\boldsymbol \theta | \mathbf X , \mathbf y) = \expn \mathcal L(\theta_i | \mathbf x_i , y)\]

\[\theta_i = g(\bb^T \varphi(\mathbf W_D^T \varphi(\mathbf W^T_{D - 1} \varphi(\mathbf W^T_{D - 2} ... \varphi(\mathbf W_1^T \mathbf x_i)))))\]

Neural Networks

Neural Networks

Neural Networks

10 categories!

Loss:

\[-\expn \sum \limits_{k = 1}^N I(y_i = k) \log \text{Pr}(y_i = k)\]

Using a linear model:

\[\theta_{ik} = \sigma_k(\bb_k^T \mathbf x_i)\]

  • The multinomial logistic regression setup with \(K\) (or really \(K - 1\)) sets of coefficients.

Neural Networks

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss, accuracy_score

# Create a Logistic Regression model
model = LogisticRegression(multi_class='multinomial', solver='lbfgs', max_iter=1000, random_state=42)

# Train the model on X_train and y_train
model.fit(X_train, y_train)

# Predict the labels for X_train
y_pred = model.predict(X_train)

# Compute the training error rate
training_error_rate = 1 - accuracy_score(y_train, y_pred)

# Compute the training log loss
y_pred_proba = model.predict_proba(X_train)
training_log_loss = log_loss(y_train, y_pred_proba)

print(f'Training error rate: {training_error_rate}')
print(f'Training log loss: {training_log_loss}')
Training error rate: 0.04654999999999998
Training log loss: 0.17400512869479443

Neural Networks

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import log_loss, accuracy_score

# Create a Random Forest model with 1000 trees
model = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the model on X_train and y_train
model.fit(X_train, y_train)

# Predict the labels for X_train
y_pred = model.predict(X_train)

# Compute the training error rate
training_error_rate = 1 - accuracy_score(y_train, y_pred)

# Compute the training log loss
y_pred_proba = model.predict_proba(X_train)
training_log_loss = log_loss(y_train, y_pred_proba)

print(f'Training error rate: {training_error_rate}')
print(f'Training log loss: {training_log_loss}')
Training error rate: 0.0
Training log loss: 0.09439185597295903

Neural Networks

Softmax NN w/ \(K\) categories:

Each hidden layer just has one set of coefficients.

\[h_{ij} = \mathbf W_j^T \mathbf h_{i(j-1)}\]

At the final layer (the output layer), create \(K\) sets of weights:

\[\theta_{ik} = \frac{\exp[\bb_k^T \mathbf h_{ij}]}{\sum \limits_{g = 1}^K \exp[\bb_g^T \mathbf h_{ij}]}\]

  • Each hidden layer is engineering a new feature set

  • The final feature set is used to train a multinomial logistic regression

Neural Networks

import pytorch_lightning as pl
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, TensorDataset
import pandas as pd
from torchmetrics import Accuracy, MeanMetric
from pytorch_lightning.callbacks import RichProgressBar

class MNISTClassifier(pl.LightningModule):
    def __init__(self, hidden_units: list = [256], lr: float = 1e-3):
        """
        Args:
            hidden_units (list): A list where each element specifies the number of units in that hidden layer.
            lr (float): Initial learning rate.
        """
        super(MNISTClassifier, self).__init__()
        
        layers = []
        input_dim = 784  # flattened image (28x28)
        # Build hidden layers dynamically
        for units in hidden_units:
            layers.append(nn.Linear(input_dim, units))
            layers.append(nn.ReLU())
            input_dim = units
        # Final output layer: maps last hidden layer to 10 classes
        layers.append(nn.Linear(input_dim, 10))
        self.model = nn.Sequential(*layers)
        
        # Loss function (cross entropy = log loss)
        self.criterion = nn.CrossEntropyLoss()

        # Learning rate

        self.lr = lr
    
    def forward(self, x):
        return self.model(x)
    
    def training_step(self, batch, batch_idx):
        x, y = batch
        logits = self(x)
        loss = self.criterion(logits, y)
        preds = torch.argmax(logits, dim=1)
        
        # Compute batch accuracy and then inaccuracy (1 - accuracy)
        batch_accuracy = (preds == y).float().mean()
        batch_inaccuracy = 1 - batch_accuracy
        
        # Log both loss and inaccuracy on every training step
        self.log('train_loss', loss, on_step=True, on_epoch=True, prog_bar=True)
        self.log('train_inaccuracy', batch_inaccuracy, on_step=True, on_epoch=True, prog_bar=True)
        return loss
    
    def configure_optimizers(self):
        # Use Adam optimizer.
        optimizer = torch.optim.Adam(self.parameters(), lr=self.lr)
        return optimizer
    
    def predict_step(self, batch, batch_idx):
        # Return a dict with:
        # - probabilities: a list of probabilities per class,
        # - predicted: a list of predicted class indices via argmax,
        # - target: the correct class labels.
        x, y = batch
        logits = self(x)
        probs = F.softmax(logits, dim=1)
        preds = torch.argmax(logits, dim=1)
        return {
            "probabilities": probs.tolist(),  # list of lists (one per instance)
            "predicted": preds.tolist(),        # list of predicted class indices
            "target": y.tolist()                # list of true labels
        }

Neural Networks

# -------------------- Data Preparation --------------------
# Load CSV data (images are already flattened to 784 features)
X_train = pd.read_csv('X_train_MNIST.csv').to_numpy()  # shape: (N, 784)
y_train = pd.read_csv('y_train_MNIST.csv').to_numpy().flatten()  # shape: (N,)

# Convert to PyTorch tensors
X_train_tensor = torch.tensor(X_train/255, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train, dtype=torch.long)

# Create dataset and DataLoader
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
train_loader = DataLoader(train_dataset, batch_size=10000, shuffle=True)

Neural Networks

# -------------------- Model Initialization & Learning Rate Tuning --------------------
# For example, use two hidden layers: 256 units and then 128 units.
model = MNISTClassifier(hidden_units=[128], lr=.01)

# Create a Lightning Trainer with a rich progress bar callback.
trainer = pl.Trainer(max_epochs=100, callbacks=[RichProgressBar()])

# -------------------- Train the Model --------------------
trainer.fit(model, train_loader)

# -------------------- Compute Metrics Using the Predictions Dictionary --------------------
# Use Lightning's built-in predict method.
predictions = trainer.predict(model, dataloaders=train_loader)
# 'predictions' is a list of dictionaries (one per batch) with:
#   - "probabilities": list of lists (each inner list holds class probabilities),
#   - "predicted": list of predicted class indices,
#   - "target": list of true labels.

# Aggregate predictions from all batches.
all_probs = []
all_preds = []
all_targets = []

for batch_pred in predictions:
    all_probs.extend(batch_pred["probabilities"])
    all_preds.extend(batch_pred["predicted"])
    all_targets.extend(batch_pred["target"])

# Convert lists to tensors.
all_probs_tensor = torch.tensor(all_probs)    # shape: (N, 10)
all_preds_tensor = torch.tensor(all_preds)      # shape: (N,)
all_targets_tensor = torch.tensor(all_targets)  # shape: (N,)

# Compute the cross entropy loss (log loss) for each instance:
# For each instance, loss = -log(probability of the correct class)
nll = -torch.log(all_probs_tensor[torch.arange(all_probs_tensor.size(0)), all_targets_tensor])
final_log_loss = nll.mean()

# Compute the inaccuracy rate.
accuracy = (all_preds_tensor == all_targets_tensor).float().mean()
final_inaccuracy_rate = 1 - accuracy

# -------------------- Print Final Metrics --------------------
print(f"Final Training Log Loss (Cross Entropy): {final_log_loss.item():.4f}")
print(f"Final Training Inaccuracy Rate: {final_inaccuracy_rate.item() * 100:.2f}%")

Neural Networks

┏━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃    Name       Type              Params  Mode  ┃
┡━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│ 0 │ model     │ Sequential       │  101 K │ train │
│ 1 │ criterion │ CrossEntropyLoss │      0 │ train │
└───┴───────────┴──────────────────┴────────┴───────┘
Trainable params: 101 K                                                                                            
Non-trainable params: 0                                                                                            
Total params: 101 K                                                                                                
Total estimated model params size (MB): 0                                                                          




Final Training Log Loss (Cross Entropy): 0.0089
Final Training Inaccuracy Rate: 0.02%

Neural Networks

# -------------------- Model Initialization & Learning Rate Tuning --------------------
# For example, use two hidden layers: 256 units and then 128 units.
model = MNISTClassifier(hidden_units=[128,128], lr=.01)

# Create a Lightning Trainer with a rich progress bar callback.
trainer = pl.Trainer(max_epochs=100, callbacks=[RichProgressBar()])

# -------------------- Train the Model --------------------
trainer.fit(model, train_loader)

# -------------------- Compute Metrics Using the Predictions Dictionary --------------------
# Use Lightning's built-in predict method.
predictions = trainer.predict(model, dataloaders=train_loader)
# 'predictions' is a list of dictionaries (one per batch) with:
#   - "probabilities": list of lists (each inner list holds class probabilities),
#   - "predicted": list of predicted class indices,
#   - "target": list of true labels.

# Aggregate predictions from all batches.
all_probs = []
all_preds = []
all_targets = []

for batch_pred in predictions:
    all_probs.extend(batch_pred["probabilities"])
    all_preds.extend(batch_pred["predicted"])
    all_targets.extend(batch_pred["target"])

# Convert lists to tensors.
all_probs_tensor = torch.tensor(all_probs)    # shape: (N, 10)
all_preds_tensor = torch.tensor(all_preds)      # shape: (N,)
all_targets_tensor = torch.tensor(all_targets)  # shape: (N,)

# Compute the cross entropy loss (log loss) for each instance:
# For each instance, loss = -log(probability of the correct class)
nll = -torch.log(all_probs_tensor[torch.arange(all_probs_tensor.size(0)), all_targets_tensor])
final_log_loss = nll.mean()

# Compute the inaccuracy rate.
accuracy = (all_preds_tensor == all_targets_tensor).float().mean()
final_inaccuracy_rate = 1 - accuracy

# -------------------- Print Final Metrics --------------------
print(f"Final Training Log Loss (Cross Entropy): {final_log_loss.item():.4f}")
print(f"Final Training Inaccuracy Rate: {final_inaccuracy_rate.item() * 100:.2f}%")

Neural Networks

┏━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃    Name       Type              Params  Mode  ┃
┡━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│ 0 │ model     │ Sequential       │  118 K │ train │
│ 1 │ criterion │ CrossEntropyLoss │      0 │ train │
└───┴───────────┴──────────────────┴────────┴───────┘
Trainable params: 118 K                                                                                            
Non-trainable params: 0                                                                                            
Total params: 118 K                                                                                                
Total estimated model params size (MB): 0                                                                          




Final Training Log Loss (Cross Entropy): 0.0009
Final Training Inaccuracy Rate: 0.00%

Neural Networks

# -------------------- Model Initialization & Learning Rate Tuning --------------------
# For example, use two hidden layers: 256 units and then 128 units.
model = MNISTClassifier(hidden_units=[128,128,128], lr=.01)

# Create a Lightning Trainer with a rich progress bar callback.
trainer = pl.Trainer(max_epochs=100, callbacks=[RichProgressBar()])

# -------------------- Train the Model --------------------
trainer.fit(model, train_loader)

# -------------------- Compute Metrics Using the Predictions Dictionary --------------------
# Use Lightning's built-in predict method.
predictions = trainer.predict(model, dataloaders=train_loader)
# 'predictions' is a list of dictionaries (one per batch) with:
#   - "probabilities": list of lists (each inner list holds class probabilities),
#   - "predicted": list of predicted class indices,
#   - "target": list of true labels.

# Aggregate predictions from all batches.
all_probs = []
all_preds = []
all_targets = []

for batch_pred in predictions:
    all_probs.extend(batch_pred["probabilities"])
    all_preds.extend(batch_pred["predicted"])
    all_targets.extend(batch_pred["target"])

# Convert lists to tensors.
all_probs_tensor = torch.tensor(all_probs)    # shape: (N, 10)
all_preds_tensor = torch.tensor(all_preds)      # shape: (N,)
all_targets_tensor = torch.tensor(all_targets)  # shape: (N,)

# Compute the cross entropy loss (log loss) for each instance:
# For each instance, loss = -log(probability of the correct class)
nll = -torch.log(all_probs_tensor[torch.arange(all_probs_tensor.size(0)), all_targets_tensor])
final_log_loss = nll.mean()

# Compute the inaccuracy rate.
accuracy = (all_preds_tensor == all_targets_tensor).float().mean()
final_inaccuracy_rate = 1 - accuracy

# -------------------- Print Final Metrics --------------------
print(f"Final Training Log Loss (Cross Entropy): {final_log_loss.item():.4f}")
print(f"Final Training Inaccuracy Rate: {final_inaccuracy_rate.item() * 100:.2f}%")

Neural Networks

┏━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃    Name       Type              Params  Mode  ┃
┡━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│ 0 │ model     │ Sequential       │  134 K │ train │
│ 1 │ criterion │ CrossEntropyLoss │      0 │ train │
└───┴───────────┴──────────────────┴────────┴───────┘
Trainable params: 134 K                                                                                            
Non-trainable params: 0                                                                                            
Total params: 134 K                                                                                                
Total estimated model params size (MB): 0                                                                          




Final Training Log Loss (Cross Entropy): 0.0002
Final Training Inaccuracy Rate: 0.00%

Neural Networks

With enough hidden units and layers, we can arbitrarily interpolate the complicated training set

  • Pretty quick with PyTorch (roughly 20 seconds to train with 100 epochs and a batch size of 10,000)

Note that the random forest performed just as well for this tabular data

  • This is generally going to be true when the training data is a single outcome with a table of features!

  • Better generalization, usually

Neural Networks

The real magic of this is that we were able to construct a 3 layer NN with ReLU activations, a softmax output layer, and cross entropy loss in roughly 10 lines of code!

  • It just took in the model instructions and trained our specific neural network with very little additional input needed

  • PyTorch is quite verbose in its setup, but this has more to do with how we train the model

  • The actual model specification is all contained in that init step.

Neural Networks

This isn’t a trivial task

  • Specify the model

  • Compute the gradients

  • Set up the ADAM optimizer

  • Run to convergence

Neural Networks

PyTorch is modular

  • I can add a new layer with any activation function easily with one new line of code

  • It just knows what to do!

The questions is how is it able to train models with so little code?

Neural Networks

The method used by modern NN architectures is called autodiff

  • Specify the model as a computational graph that maps inputs to outputs through unknowns and simple primitive operations

  • For each node in the graph, compute the simple local gradient with respect to knowns and unknowns

  • Use backpropogation to combine the local gradients in an efficient way to get the gradient of the loss w.r.t. any local unknowns

  • Use SGD to minimize training loss w.r.t. the inputs and loss structure!

Computational Graphs

Computational Graph for Logistic Regression:

Computational Graphs

Two primitive operations:

  • Multiplication

  • Addition

One complex operation:

  • Sigmoid - return \(\frac{1}{1 + \exp[-x]}\)

Two unknowns:

  • \(\mathbf W\) - a \(2 \times 1\) vector of coefficients

  • \(b\) - a single bias term

Computational Graphs

Operations:

  1. Cross multiply X and W to get q

  2. Add b to q to get z

  3. Compute the sigmoid function for z

  4. Add it all together to get the loss (Cross entropy)

A collection of simple operations arranged in an acyclic graph

Computational Graphs

Gradient descent:

  • Start by setting initial values for the unknowns (W and b)

  • Use current values to compute intermediate values (q and z)

  • Evaluate the gradient at the current values

  • Update the unknowns

  • Repeat a lot until convergence

Computational Graphs

The process of using the curent values of the unknowns to set the intermediates is referred to as the forward pass

  • From the bottom of the graph to the top, fill in any values!

  • Moving through the graph in the forward direction.

Computational Graphs

The next step is to find the gradient of the loss w.r.t. the unknowns.

Two unknowns:

\[\frac{\partial \mathcal L}{\partial W} \text{ ; } \frac{\partial \mathcal L}{\partial b}\]

We know that we can get these gradients using the chain rule

\[\frac{\partial \mathcal L}{\partial W} = \frac{\partial \mathcal L}{\partial z}\frac{\partial z}{\partial q}\frac{\partial q}{\partial W}\]

\[\frac{\partial \mathcal L}{\partial b} = \frac{\partial \mathcal L}{\partial z}\frac{\partial z}{\partial b}\]

  • Note that both of the unknowns share upstream gradients

Computational Graphs

For logistic regression, the loss layer is cross-entropy:

\[\mathcal L(z | \mathbf X , \mathbf y) = -\expn \sum y_i \log \sigma(z_i) + (1 - y_i)(1 - \sigma(z_i))\]

To make this easier, we’re just going to look at a single instance and note that the gradients at each step are an average over the instances.

First step, find the derivative of the loss w.r.t. \(z_i\).

Computational Graphs

We’ve seen this a lot, so we won’t re-derive. The gradient for a cross-entropy layer with 2 categories is always a scalar:

\[\frac{\partial \mathcal L_i}{\partial z_i} = \sigma(z_i) - y_i\]

where \(\sigma_{i}\) is:

\[\frac{1}{1 + \exp[-z_i]}\]

Computational Graphs

Let’s move down the chain for \(\mathbf W\). Next, we want to find:

\[\frac{\partial \mathcal L_i}{\partial q_i} = \frac{\partial \mathcal L_i}{\partial z_i}\frac{\partial z_i}{\partial q_i}\]

  • The gradient for this variable is a product of the upstream gradient - \(\frac{\partial \mathcal L_i}{\partial z_i}\) - and the local gradient - \(\frac{\partial z_i}{\partial q_i}\)

\[\mathbf z_i = \mathbf q_i + b\]

Computational Graphs

The derivative here evaluates to 1, so:

\[\frac{\partial \mathcal L_i}{\partial q_i} = (\sigma(z_i) - y_i) \times 1\]

We also have the bias term in this step! So, we can find the gradient w.r.t. one of our unknowns:

\[\frac{\partial \mathcal L_i}{\partial b} = \frac{\partial \mathcal L_i}{\partial z_i}\frac{\partial z_i}{\partial b} = (\sigma(z_i) - y_i) \times 1\]

\[\frac{\partial \mathcal L}{\partial b} = \expn \frac{\partial \mathcal L_i}{\partial z_i}\frac{\partial z_i}{\partial b} = \expn (\sigma(z_i) - y_i)\]

Computational Graphs

Finally, we have

\[\frac{\partial \mathcal L_i}{\partial W} = \frac{\partial \mathcal L_i}{\partial q_i} \frac{\partial q_i}{\partial W}\]

\(\mathbf W\) is a \(P\) vector and \(q_i\) is a scalar.

\(\frac{\partial q_i}{\partial W}\) is a P-vector gradient

  • Element \(j\) of the gradient corresponds to how much \(q_i\) changes if we change \(w_j\) just a little.

Computational Graphs

Using the rules of vector derivatives, we have:

\[q_i = \mathbf W^T \mathbf x_i\]

\[\frac{\partial q_i}{\partial W} = \mathbf x_i\]

\[\frac{\partial \mathcal L_i}{\partial W} = (\sigma(z_i) - y_i)\mathbf x_i\]

\[\frac{\partial \mathcal L}{\partial W} = \expn (\sigma(z_i) - y_i)\mathbf x_i\]

which we recognize as the gradient for the coefficients in logistic regression!

Computational Graphs

This is no different than what we’ve done before!

  • Just treated very mechanically

Algorithmically, even.

Computational Graphs

Backpropogation is an autodiff algorithm that has two steps:

  • Forward pass: set the values of any intermediate variables given the unknowns

  • Backwards pass: Compute the gradient given the intermediate values

Most of the hard work is done in the backwards pass!

Leverage the chain rule:

\[\text{Downstream Grad} = \text{Local Grad} \times \text{Upstream Grad}\]

\[\frac{\partial \mathcal L}{\partial x} = \frac{\partial \mathcal L}{\partial q} \frac{\partial q}{\partial x}\]

Computational Graphs

Now, let’s think about setting up a graph for a two-layer NN with \(K_1\) hidden units in the first layer and \(K_2\) hidden units in the second layer.

  • The graph gets a little too unruly to represent via the nice dot plots, so we’ll just use equations here.

Computational Graphs

\[\mathcal L(\mathbf z | \mathbf X, \mathbf y) = -\expn y_i \log \sigma(z_i) + (1 - y_i)(1 - \sigma(z_i))\]

\[\underset{(1 \times 1)}{z_i} = \underset{(1 \times K_2)}{\bb^T} \underset{(K_2 \times 1)}{\mathbf h_{i2}} + \underset{(1 \times 1)}{a}\]

\[\underset{(K_2 \times 1)}{\mathbf h_{i2}} = \varphi(\underset{(K_2 \times 1)}{\mathbf q_{i2}})\]

\[\underset{(K_2 \times 1)}{\mathbf q_{i2}} = \underset{(K_2 \times K_1)}{\mathbf W_2^T}\underset{(K_1 \times 1)}{\mathbf h_{i1}} + \underset{(\mathbf K_2 \times 1)}{\mathbf b_2}\]

\[\underset{(K_1 \times 1)}{\mathbf h_{i1}} = \varphi(\underset{(K_1 \times 1)}{\mathbf q_{i1}})\]

\[\underset{(K_1 \times 1)}{\mathbf q_{i1}} = \underset{(K_1 \times P)}{\mathbf W_1^T}\underset{(P \times 1)}{\mathbf x_i} + \underset{(\mathbf K_1 \times 1)}{\mathbf b_1}\]

Computational Graphs

Unknowns:

  • Intercepts: \(a\), \(\mathbf b_2\), and \(\mathbf b_1\)

  • Weights: \(\boldsymbol \beta\), \(\mathbf W_2\), \(\mathbf W_1\)

Gradients to compute:

\(\frac{\partial \mathcal L_i}{\partial a}\),\(\frac{\partial \mathcal L_i}{\partial \mathbf b_1}\),\(\frac{\partial \mathcal L_i}{\partial \mathbf b_1}\),\(\frac{\partial \mathcal L_i}{\partial \boldsymbol \beta}\),\(\frac{\partial \mathcal L_i}{\partial \mathbf W_2}\),\(\frac{\partial \mathcal L_i}{\partial \mathbf W_1}\)

Computational Graphs

Starting at the top:

\[\frac{\partial L_i}{z_i} = \underset{(1 \times 1)}{\sigma(z_i) - y_i}\]

\[\underset{(1 \times 1)}{z_i} = \underset{(1 \times K_2)}{\bb^T} \underset{(K_2 \times 1)}{\mathbf h_{i2}} + \underset{(1 \times 1)}{a}\]

For \(z_i\), three derivatives:

  • \(\frac{\partial z_i}{\partial \boldsymbol \beta} = \underset{(1 \times K_2)}{\mathbf h_{i2}^T}\)

  • \(\frac{\partial z_i}{\partial a} = \underset{(1 \times 1)}{1}\)

  • \(\frac{\partial z_i}{\mathbf h_{i2}} = \underset{(1 \times K_2)}{\bb^T}\)

Computational Graphs

At this point, three chains of derivatives:

\[\frac{\partial \mathcal L_i}{\partial \boldsymbol \beta} = \underset{(1 \times 1)}{[\sigma(z_i) - y_i]} \otimes \underset{(1 \times K_2)}{\mathbf h_{i2}^T}\]

\[\frac{\partial \mathcal L_i}{\partial a} = \underset{(1 \times 1)}{[\sigma(z_i) - y_i]} \otimes \underset{(1 \times 1)}{1}\]

\[\frac{\partial \mathcal L_i}{\partial \mathbf h_{i2}} = \underset{(1 \times 1)}{[\sigma(z_i) - y_i]} \otimes \underset{(1 \times K_2)}{\bb^T}\]

  • The chains for \(\boldsymbol \beta\) and \(a\) are done!

  • 2 out of 6 complete!

Computational Graphs

\[\underset{(K_2 \times 1)}{\mathbf h_{i2}} = \varphi(\underset{(K_2 \times 1)}{\mathbf q_{i2}})\]

We want to find the derivative of a vector w.r.t. a vector.

Vector derivatives w.r.t a vector yield a Jacobian

\[\mathbf h_{i2} \in \mathbb R^{{K_2}} \text{ ; } \mathbf q_{i2} \in \mathbb R^{{K_2}}\]

\[\mathbf J_{jk} = \frac{\partial h_{i2,k}}{\partial q_{i2,j}}\]

  • A small change in element \(j\) of \(\mathbf q\), how much does element \(k\) of \(\mathbf h\) change?

  • Yields a \(K_2 \times K_2\) matrix of derivatives!

Computational Graphs

Shortcut:

Recall that the ReLU activation function is applied elementwise

If \(j \neq k\), does a change in \(q\) result in a change in \(h\)?

\[\underset{(K_2 \times 1)}{\mathbf h_{i2}} = \varphi(\underset{(K_2 \times 1)}{\mathbf q_{i2}})\]

Nope! Only a change in the corresponding element results in a change!

  • Therefore, all of the off-diagonals of the Jacobian are 0.

Computational Graphs

Diagonals:

  • If \(q_{i2,j} > 0\), \(\frac{\partial h_{i2,j}}{\partial q_{i2,j}} = 1\)
  • If \(q_{i2,j} < 0\), \(\frac{\partial h_{i2,j}}{\partial q_{i2,j}} = 0\)

(Just pretend this is never equal to zero…)

\[ \frac{\partial h_{i2}}{\partial q_{i2}} = \begin{bmatrix} I(q_{i2,1} > 0) & 0 & \ldots & 0\\ 0 & I(q_{i2,2} > 0) & \ldots & 0\\ \vdots & \vdots & \ddots & 0\\ 0 & 0 & \ldots & I(q_{i2,K_2} > 0)\\ \end{bmatrix} = I(\mathbf q_{i2} > 0)^T\mathcal I_{K_2} \]

Computational Graphs

Since it’s a diagonal matrix, we can think of applying this derivative as an elementwise operation:

\[\frac{\partial \mathcal L_i}{\partial \mathbf q_{i2}} = \underset{(1 \times 1)}{[\sigma(z_i) - y_i]} \otimes \underset{(1 \times K_2)}{\bb^T} \odot \underset{(1 \times K_2)}{I(\mathbf q_{i2} > 0)}\]

  • Down the chain we go!

Computational Graphs

\[\underset{(K_2 \times 1)}{\mathbf q_{i2}} = \underset{(K_2 \times K_1)}{\mathbf W_2}\underset{(K_1 \times 1)}{\mathbf h_{i1}} + \underset{(\mathbf K_2 \times 1)}{\mathbf b_2}\]

Three derivatives here:

  • \(\frac{\partial \mathbf q_{i2}}{\partial \mathbf W_2}\) (Terminal)

  • \(\frac{\partial \mathbf q_{i2}}{\partial \mathbf h_{i1}}\)

  • \(\frac{\partial \mathbf q_{i2}}{\partial \mathbf b_2}\) (Terminal)

These are by far the trickiest step (but will always end up being the same thing)

Computational Graphs

Let’s start with the easiest one - \(\frac{\partial \mathbf q_{i2}}{\partial \mathbf b_2}\)

  • Each one is a \(K_2\) vector

  • As with elementwise ReLU, the \(k^{th}\) element of \(\mathbf b_2\) only changes the \(k^{th}\) element of \(\mathbf q_{i2}\)

Vector-vector derivative yields a \(K_2 \times K_2\) Jacobian, so:

\[ \frac{\partial \mathbf q_{i2}}{\partial \mathbf b_2} = \begin{bmatrix} 1 & 0 & \ldots & 0\\ 0 & 1 & \ldots & 0\\ \vdots & \vdots & \ddots & 0\\ 0 & 0 & \ldots & 1\\ \end{bmatrix} \]

\[\frac{\partial \mathcal L_i}{\partial \mathbf b_2} = \underset{(1 \times 1)}{[\sigma(z_i) - y_i]} \otimes \underset{(1 \times K_2)}{\bb^T} \odot \underset{(1 \times K_2)}{I(\mathbf q_{i2} > 0)} \otimes \underset{(K_2 \times K_2)}{\mathcal I_{K_2}}\]

Computational Graphs

Now, let’s work on \(\frac{\partial \mathbf q_{i2}}{\partial \mathbf h_{i1}}\)

  • Remember that \(\mathbf h_{i1}\) is the post-ReLU representation of observation \(i\) in the first hidden space!

\[\underset{(K_2 \times 1)}{\mathbf q_{i2}} = \underset{(K_2 \times K_1)}{\mathbf W_2}\underset{(K_1 \times 1)}{\mathbf h_{i1}} + \underset{(\mathbf K_2 \times 1)}{\mathbf b_2}\]

A general matrix derivative rule:

\[\frac{d}{d \mathbf x} \mathbf A \mathbf x = \mathbf A\]

\[\frac{\partial \mathbf q_{i2}}{\partial \mathbf h_{i1}} = \mathbf W_2\]

  • No need to reprove a well-known derivative rule!

Computational Graphs

The chain:

\[\frac{\partial \mathcal L_i}{\partial \mathbf h_{i1}} = \underset{(1 \times 1)}{[\sigma(z_i) - y_i]} \otimes \underset{(1 \times K_2)}{\bb^T} \odot \underset{(1 \times K_2)}{I(\mathbf q_{i2} > 0)} \otimes \underset{(K_2 \times K_1)}{\mathbf W_2}\]

  • We can check our work by thinking about the dimensionality of the Jacobian

  • \(\mathbf h_{i1}\) will be a vector with \(K_1\) values

  • The derivative of the loss w.r.t. \(\mathbf h_{i1}\) is of length \(\mathbf K_1\)

Computational Graphs

Now the hard one:

\[\frac{\partial \mathbf q_{i2}}{\partial \mathbf W_{2}}\]

Why this is tricky is that we know what the size of the Jacobian will be

  • \(\mathbf q_{i2}\) is a \(K_2 \times 1\) vector and \(\mathbf W_2\) is a \(K_2 \times K_1\) matrix.

  • This will yield a Jacobian of size \(\mathbf K_2 \times \mathbf K_2 \times \mathbf K_1\) - a 3D tensor

  • We know this because the resulting Jacobian needs to be of size \(\mathbf K_1 \times K_2\) - how the loss changes as a function of each element of the weight matrix!

Computational Graphs

We don’t want to deal with this.

  • If \(K_1 = 128\) and \(K_2 = 128\), then this matrix is roughly 8.4 MB

  • As these models get larger, we really don’t want to be carrying around many matrices of that size…

  • Also really hard to think about!!!!!

Computational Graphs

A trick:

Let’s think about \(\mathbf q_{i2}\) in sum notation

  • Remember that \(\mathbf q\) is a \(K_2\) vector , \(\mathbf W\) is a \(K_2 \times K_1\) matrix, and \(\mathbf h_1\) is a \(K_1\) vector.

\[\mathbf q_{i2} = \left[\sum \limits_{k = 1}^{K_1} \mathbf W_{1k} h_{k}, \sum \limits_{k = 1}^{K_1} \mathbf W_{2k} h_{k},...,\sum \limits_{k = 1}^{K_1} \mathbf W_{K_2k} h_{k} \right]\]

Let’s think about taking the derivative w.r.t. one element of \(\mathbf W\) - \(\mathbf W_{1,1}\)

\[q_1 = \mathbf W_{1,1} h_1 + \mathbf W_{1,2} h_2 + ... + \mathbf W_{1,K_1} h_{K_1}\]

\[q_2 = \mathbf W_{2,1} h_1 + \mathbf W_{2,2} h_2 + ... + \mathbf W_{2,K_1} h_{K_1}\]

\(\mathbf W_{1,1}\) only shows up once!

Computational Graphs

Now, let’s think of the partial of \(\mathbf q_{i2}\) w.r.t. \(\mathbf W_{1,1}\)

\[\frac{\partial \mathbf q_{i2}}{\partial \mathbf W_{1,1}} = \left[h_{1}, 0 , 0 , 0 , ... , 0 \right]^T\]

\[\frac{\partial \mathbf q_{i2}}{\partial \mathbf W_{1,2}} = \left[h_{2}, 0 , 0 , 0 , ... , 0 \right]^T\]

\[\frac{\partial \mathbf q_{i2}}{\partial \mathbf W_{2,1}} = \left[0, h_1 , 0 , 0 , ... , 0 \right]^T\]

Computational Graphs

Generic rule:

\[\mathbf v_{r,c}^T = \frac{\partial \mathbf q_{i}}{\partial \mathbf W_{r,c}} = \left[\underset{1}{0}, \underset{2}{0} , ... , \underset{r}{h_c} , ... , \underset{K_2}{0} \right]^T\]

  • A sparse vector of length \(K_2\)

Now, let’s grab the gradient w.r.t. the loss

\[\frac{\partial \mathcal L_i}{\partial \mathbf W_{2: r,c}} = \underset{(1 \times 1)}{[\sigma(z_i) - y_i]} \otimes \underset{(1 \times K_2)}{\bb^T} \odot \underset{(1 \times K_2)}{I(\mathbf q_{i2} > 0)} \otimes \underset{(K_2 \times 1)}{v_{r,c}}\]

Computational Graphs

Let \(\mathbf u\) represent the upstream gradient:

\[\underset{(1 \times K_2)}{\mathbf u} = \underset{(1 \times 1)}{[\sigma(z_i) - y_i]} \otimes \underset{(1 \times K_2)}{\bb^T} \odot \underset{(1 \times K_2)}{I(\mathbf q_{i2} > 0)}\]

\[\underset{K_2 \times 1}{\mathbf v_{r,c}} = \frac{\partial \mathbf q_{i}}{\partial \mathbf W_{r,c}} = \left[\underset{1}{0}, \underset{2}{0} , ... , \underset{r}{h_c} , ... , \underset{K_2}{0} \right]^T\]

\[\frac{\partial \mathcal L_i}{\partial \mathbf W_{2: r,c}} = \mathbf u \mathbf v_{r,c} = u_r h_c\]

Computational Graphs

\[ \frac{\partial \mathcal L_i}{\partial \mathbf W_{2}} = \begin{bmatrix} u_1 h_1 & u_1 h_2 & \ldots & u_1 h_{K_1}\\ u_2 h_1 & u_2 h_2 & \ldots & u_2 h_{K_1}\\ \vdots & \vdots & \ddots & \vdots \\ u_{K_2} h_1 & u_{K_2} h_2 & \ldots & u_{K_2} h_{K_1}\\ \end{bmatrix} \]

Computational Graphs

Put more succinctly, we can define the gradient as:

\[ \frac{\partial \mathcal L_i}{\partial \mathbf W_{2}} = \underset{(K_2 \times 1)}{\mathbf u^T} \times \underset{(1 \times K_1)}{\mathbf h_1}^T \]

All that work to get something workable!

  • 4 out of 6 gradients derived!

  • 2 more to go…

Computational Graphs

Unfortunately, we’ve still got work to do!

Or do we?

Computational Graphs

\[\mathcal L(\mathbf z | \mathbf X, \mathbf y) = -\expn y_i \log \sigma(z_i) + (1 - y_i)(1 - \sigma(z_i))\]

\[\underset{(1 \times 1)}{z_i} = \underset{(1 \times K_2)}{\bb^T} \underset{(K_2 \times 1)}{\mathbf h_{i2}} + \underset{(1 \times 1)}{a}\]

\[\underset{(K_2 \times 1)}{\mathbf h_{i2}} = \varphi(\underset{(K_2 \times 1)}{\mathbf q_{i2}})\]

\[\underset{(K_2 \times 1)}{\mathbf q_{i2}} = \underset{(K_2 \times K_1)}{\mathbf W_2^T}\underset{(K_1 \times 1)}{\mathbf h_{i1}} + \underset{(\mathbf K_2 \times 1)}{\mathbf b_2}\]

\[\underset{(K_1 \times 1)}{\mathbf h_{i1}} = \varphi(\underset{(K_1 \times 1)}{\mathbf q_{i1}})\]

\[\underset{(K_1 \times 1)}{\mathbf q_{i1}} = \underset{(K_1 \times P)}{\mathbf W_1^T}\underset{(P \times 1)}{\mathbf x_i} + \underset{(\mathbf K_1 \times 1)}{\mathbf b_1}\]

Computational Graphs

At this point, we have \(\frac{\partial \mathcal L_i}{h_{i1}}\):

\[\frac{\partial \mathcal L_i}{\partial \mathbf h_{i1}} = \underset{(1 \times 1)}{[\sigma(z_i) - y_i]} \otimes \underset{(1 \times K_2)}{\bb^T} \odot \underset{(1 \times K_2)}{I(\mathbf q_{i2} > 0)} \otimes \underset{(K_2 \times K_1)}{\mathbf W_2}\]

Computational Graphs

Next is \(\frac{\partial \mathbf h_{i1}}{\partial \mathbf q_{i1}}\).

\[\underset{(K_1 \times 1)}{\mathbf h_{i1}} = \varphi(\underset{(K_1 \times 1)}{\mathbf q_{i1}})\]

We already know what this is going to be!

A diagonal matrix with \(I(\mathbf q_{i1} > 0)\) along the diagonal

\[\frac{\partial \mathcal L_i}{\partial \mathbf h_{i1}} = \underset{(1 \times 1)}{[\sigma(z_i) - y_i]} \otimes \underset{(1 \times K_2)}{\bb^T} \odot \underset{(1 \times K_2)}{I(\mathbf q_{i2} > 0)} \otimes \underset{(K_2 \times K_1)}{\mathbf W_2} \odot \underset{(1 \times K_1)}{I(\mathbf q_{i1} > 0)}\]

Computational Graphs

One layer to go!

\[\underset{(K_1 \times 1)}{\mathbf q_{i1}} = \underset{(K_1 \times P)}{\mathbf W_1^T}\underset{(P \times 1)}{\mathbf x_i} + \underset{(\mathbf K_1 \times 1)}{\mathbf b_1}\]

Three derivatives here:

  • \(\frac{\partial \mathbf q_{i1}}{\partial \mathbf W_1}\) (Terminal)

  • \(\frac{\partial \mathbf q_{i1}}{\partial \mathbf x_i}\) (Terminal)

  • \(\frac{\partial \mathbf q_{i2}}{\partial \mathbf b_1}\) (Terminal)

We’ve seen these derivative forms before…

Computational Graphs

\[ \frac{\partial \mathbf q_{i1}}{\partial \mathbf b_1} = \begin{bmatrix} 1 & 0 & \ldots & 0\\ 0 & 1 & \ldots & 0\\ \vdots & \vdots & \ddots & 0\\ 0 & 0 & \ldots & 1\\ \end{bmatrix} \]

  • A \(K_1 \times K_1\) identity matrix

\[\frac{\partial \mathcal L_i}{\partial \mathbf b_{1}} = \underset{(1 \times 1)}{[\sigma(z_i) - y_i]} \otimes \underset{(1 \times K_2)}{\bb^T} \odot \underset{(1 \times K_2)}{I(\mathbf q_{i2} > 0)} \otimes \underset{(K_2 \times K_1)}{\mathbf W_2} \odot \underset{(1 \times K_1)}{I(\mathbf q_{i1} > 0)}\]

Computational Graphs

\[\underset{(K_1 \times 1)}{\mathbf q_{i1}} = \underset{(K_1 \times P)}{\mathbf W_1^T}\underset{(P \times 1)}{\mathbf x_i} + \underset{(\mathbf K_1 \times 1)}{\mathbf b_1}\]

Derivative w.r.t. \(\mathbf x_i\) uses matrix derivative rule

\[\frac{\partial \mathcal L_i}{\partial \mathbf x_{i}} = \underset{(1 \times 1)}{[\sigma(z_i) - y_i]} \otimes \underset{(1 \times K_2)}{\bb^T} \odot \underset{(1 \times K_2)}{I(\mathbf q_{i2} > 0)} \otimes \underset{(K_2 \times K_1)}{\mathbf W_2} \odot \underset{(1 \times K_1)}{I(\mathbf q_{i1} > 0)} \otimes \underset{(K_1 \times P)}{\mathbf W_1}\]

Computational Graphs

Gradient of the loss w.r.t. \(\mathbf W_1\) uses the outer product rule for linear layers:

\[ \mathbf u = \underset{(1 \times 1)}{[\sigma(z_i) - y_i]} \otimes \underset{(1 \times K_2)}{\bb^T} \odot \underset{(1 \times K_2)}{I(\mathbf q_{i2} > 0)} \otimes \underset{(K_2 \times K_1)}{\mathbf W_2} \odot \underset{(1 \times K_1)}{I(\mathbf q_{i1} > 0)} \]

\[ \frac{\partial \mathcal L_i}{\partial \mathbf W_1} = \mathbf u^T \times \mathbf x_{i}^T \]

Computational Graphs

We have all of the rules!

As long as we know what kind of layer we’re computing, we know that the gradient/Jacobian has a known form!

Computational Graphs

Computational Graphs

Computational Graphs

Computational Graphs

PyTorch is just an autodiff machine

  • Interpret the user defined graph and draw the corresponding computational graph

  • Find the appropriate forward pass functions

  • Find the appropriate backwards pass gradient functions

  • On a backwards pass, multiply the gradients/Jacobians to get the SGD update step

This software just optimizes the differentiation routine for fitting models via chained equations!

Computational Graphs

For any neural network, you could program up your own function that does forward and backwards passes

  • It is really as simple as writing the constituent functions and evaluating given the knowns or unknowns

At scale, though, your function is going to be walking through mud compared to PyTorch!

  • Blazing fast

  • C Compiled

  • A lot of very nice utility functions

Problem Gradients

With this setup, we can train arbitrarily complex neural networks

  • Large width

  • Large depth

Why don’t we do this all the time?

  • Why do we need to consider more tricks of the trade if these methods will work arbitrarily well with enough hidden units and layers?

Problem Gradients

\[ \mathbf u = \underset{(1 \times 1)}{[\sigma(z_i) - y_i]} \otimes \underset{(1 \times K_2)}{\bb^T} \odot \underset{(1 \times K_2)}{I(\mathbf q_{i2} > 0)} \otimes \underset{(K_2 \times K_1)}{\mathbf W_2} \odot \underset{(1 \times K_1)}{I(\mathbf q_{i1} > 0)} \]

\[ \frac{\partial \mathcal L_i}{\partial \mathbf W_1} = \mathbf u^T \times \mathbf x_{i}^T \]

Note that the gradient for the loss w.r.t. \(\mathbf W_1\) depends on the values of \(\mathbf W_2\)

Problem Gradients

If we extend this out to a NN of arbitrary depth, we get a function of the form:

\[ \mathbf u = [\sigma(z_i) - y_i] \otimes \bb^T \odot I(\mathbf q_{iD} > 0) \otimes \mathbf W_{D} \odot I(\mathbf q_{i(D - 1)} > 0)\otimes \mathbf W_{D-1}. . .\odot I(\mathbf q_{i(1)} > 0)\otimes \mathbf W_{1} \]

  • A chain of activation functions and layer weights

Because the gradient of the weights at a layer is a function of a product of all of the previous weights, we can run into issues that arise due to chains of multiplication

Problem Gradients

Let \(\epsilon\) be a positive value

\[(1 + \epsilon)^D \rightarrow \infty\]

\[(1 - \epsilon)^D \rightarrow 0\]

The key part is that the more depth we have, the smaller \(\epsilon\) needs to be in order for the gradient to explode or vanish

Problem Gradients

Gradient explosion is a problem that arises when we have a lot of weights (after passing through the nonlinear activation) that are generally greater than 1.

  • This corresponds to locations in the loss surface where there are steep drops

Problem Gradients

Problem Gradients

Problem Gradients

We shoot way off into the distance and will never return…

Two times we see this happen (most frequently):

  • Really deep models with a lot of multiplication of weights

  • Bad starting values

Next time, we’ll talk about residual connections which can help to mitigate the first one

Problem Gradients

One easy way to fix the exploding gradient problem is to use a method called gradient clipping

The standard SGD step:

\[\boldsymbol \theta_t = \boldsymbol \theta_{t - 1} - \eta \mathbf g(\theta_{t - 1})\]

where the gradient sets the optimal direction of descent

  • Note that the gradient is a vector in the loss space that sets the direction

  • The size is also set by the gradient to try to maximize the “benefit” of a single step

  • As long as we’re moving in the right direction, we can temper the size and be okay

Problem Gradients

Gradient clipping caps the amount with which we can move in a single step:

\[\boldsymbol \theta_t = \boldsymbol \theta_{t - 1} - \eta \mathbf g'(\theta_{t - 1})\]

\[\mathbf g'(\theta_{t - 1}) = \text{min}\left(1 , \frac{c}{\|g(\theta_{t - 1})\|} \right)g(\theta_{t - 1})\]

where \(c\) is some positive value.

  • Set to 3? Depends on the problem.

Problem Gradients

Additionally, try to ensure that the input features are normalized to exist on a small subset of the real line close to zero

  • Standardize the variance

  • Scale to be between 0 and 1

Helps to ensure that the corresponding coefficients are small.

Problem Gradients

Another problem arises when we’re choosing starting values.

Choosing starting values for unknowns is a required part of GD routines.

  • We have to start the unknowns somewhere!

A common way to initialize the procedure is to set each value of the hidden weights matrix to a random independent draw from a normal distribution.

  • In the initial forward pass, use these weights to create the initial set of hidden units values

Problem Gradients

Let’s assume that each column of \(\mathbf X\) has mean 0 and variance 1 - \(\mathbf x_i \in \mathbb R^P\)

Randomly draw each value of \(\mathbf W_1\) from a standard normal, as well.

Given an initial \(\mathbf W_1\) (a \(P \times K_1\) matrix), we can compute the set of \(K_1\) hidden values for an observation as:

\[\mathbf h_{i1} = \mathbf W_1^T \mathbf x_i\]

Problem Gradients

512 Hidden Units

Problem Gradients

The variance of the values is blowing up!

  • Big hidden unit values means big weights and exploding gradients!

For a normalized feature set, the variance of the starting values for the hidden units is:

\[P \text{ or } N_{In}\]

as the hidden unit is a sum over the initial weight value.

  • The number of incoming connections to a hidden node.

Problem Gradients

On the other side, the variance of the gradient is going to explode as well depending on the number of outgoing connections. If \(\mathbf h_1\) connects to \(\mathbf h_2\) via:

\[\mathbf h_2 = \mathbf W_2^T \mathbf h_1\]

then the gradient of the loss is a function of the outgoing layer weights:

\[\frac{\partial \mathcal L_i}{\partial \mathbf h_{id}} = \expn \mathbf u_i \mathbf W_{d + 1}^T\]

where \(\mathbf u_i\) is a \(1 \times K_{d + 1}\) vector and \(\mathbf W_{d + 1}^T\) is a \(K_{d + 1} \times K_d\) matrix.

Problem Gradients

512 Hidden Units

Problem Gradients

A smart initialization scheme is called Glorot Initialization

Given the number of incoming and outgoing connections affected by \(\mathbf W_k\), initialize the matrix to be random draws from a normal distribution with mean 0 and variance:

\[\sigma^2_{W_k} = \frac{2}{N_{In} + N_{Out}}\]

which drags down the initial variance of the hidden values!

  • Light twists on this scheme is the default in PyTorch!

Problem Gradients

On the other side, we also have the issue of vanishing gradients

  • The gradients for different parameters get stuck at small values due to architectural choices

  • We can’t escape from the suboptimal part of the loss space

These are a little harder to deal with

  • Most commonly dead ReLUs

We’ll start here next time!